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Abstract 

Memory becomes a limiting factor in contemporary applications, such 
as analyses of the Webgraph and molecular sequences, when many objects 
need to be counted simultaneously. Robert Morris [Communications of 
the ACM, 21:840-842, 1978] proposed a probabilistic technique for ap- 
proximate counting that is extremely space-efficient. The basic idea is to 
increment a counter containing the value X with probability 2"^^. As a 
result, the counter contains an approximation of Ign after n probabilistic 
updates stored in Ig Ig n bits. Here we revisit the original idea of Morris, 
and introduce a binary floating-point counter that uses a d-bit significand 
in conjunction with a binary exponent. The counter yields a simple for- 
mula for an unbiased estimation of n with a standard deviation of about 
0.6 ■ n2~'^/'^ , and uses d + Ig Ig n bits. 

We analyze the floating-point counter's performance in a general frame- 
work that applies to any probabilistic counter, and derive practical for- 
mulas to assess its accuracy. 
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1 Introduction 



An elementary information-theoretic argument shows that \\g{n + 1)] bits are 
necessary to represent integers between and n (Ig denotes binary logarithm 
throughout the paper). Counting some interesting objects in a data set thus 
takes logarithmic space. Certain applications need to be more economical be- 
cause they need to maintain many counters simultaneously while, say, tracking 
patterns in large data streams. Notable examples where memory becomes a 
limiting factor include analyses of the Webgraph [Ij [3] . Numerous bioinfor- 
matics studies also require space-efficient solutions when searching for recurrent 
motifs in protein and DNA sequences. These frequent sequence motifs are asso- 
ciated with mobile, structural, regulatory or other functional elements, and have 
been studied since the first molecular sequences became available [6J. Some re- 
cent studies have concentrated on patterns involving long oligonucleotides, i.e., 
"words" of length 16-40 over the 4-letter DNA alphabet, revealing potentially 
novel regulatory features [51 [TO], and general characteristics of copying processes 
in genome evolution ^ [TT] . Hashtable-based indexing techniques [9^ used in ho- 
mology search and genome assembly procedures also rely on counting in order to 
identify repeating sequence patterns. In these applications, billions of counters 
need to be handled, making implementations difficult in mainstream comput- 
ing environments. The need for many counters is aggravated by the fact that 
the counted features often have heavy-tailed frequency distributions [21 [3l [11] , 
and there is thus no "typical" size for individual counters that could guide the 
memory allocation at the outset. As a numerical example, consider a study [5] 
of the 16-mer distribution in the human genome sequence, which has a length 
surpassing three billion. More than four billion (4^^) different words need to 
be counted, and the counter values span more than sixteen binary magnitudes 
even though the average 16-mer occurs only once or twice. 

One way to greatly reduce memory usage is to relax the requirement of exact 
counts. Namely, approximate counting to n is possible using Iglgn + 0(1) 
bits with probabilistic techniques [H [8]. The idea of probabilistic counting 
was introduced by Morris |8j. In the simplest case, a counter is initialized as 
X = 0. The counter is incremented by one at the occurrence of an event with 
probability 2~^. The counter is meant to track the magnitude of the true 
number of events. More precisely, after n events, the expected value of 2^ is 
exactly {n + 1). 

A generalization of the binary Morris counter is the so-called q-ary counter 
with some r > 1 and q — 2^1"^ . In such a setup, the counter is incremented 
with probability q^^ ■ The actual event count is estimated as f{X), using the 
transformation 

f( \ - _ S"/" - 1 

q-1 ~ 

The function / yields an unbiased estimate, as E/(X) = n after n probabilistic 
updates. The accuracy of a probabilistic counting method is characterized by 
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the variance of the estimated count. For the q-ary counter, 



Var/(X) = (,-l)!^i!^, (1) 

which is approximately for large n and r. The parameter r governs the 

tradeoff between memory usage and accuracy. The counter stores X (with 
n = f{X)) using Igr + lglgn + o(l) bits; larger r thus increases the accuracy at 
the expense of higher storage costs. 

The main goal of this study is to introduce a novel algorithm for approxi- 
mate counting. Our floating-point counter is defined with the aid of a design 
parameter M = 2'', where c? is a nonnegative integer. As we discuss later, M 
determines the tradeoff between memory usage and accuracy, analogously to 
parameter r of the q-avy counter. The procedure relies on a uniform random bit 
generator RandomBit(). Algorithm FP-lncrement below shows the incrementa- 
tion procedure for a floating-point counter, initialized with X — 0. Notice that 
the first M updates are deterministic. 



FP-lncrement(X) 


1 1 returns new value of X 


1 set t ^ [X/M\ 


1 1 bitwise right shift by d positions 


2 v^rhile i > do 




3 if RandomBitO 


— 1 then return X 


4 set i ^ i - 1 




5 return X + 1 





The counter value X — 2"^ -t + u, where u denotes the lower d bits, is used to 
estimate the actual count f{X) = (M -f u) • 2* — Af. The counter thus stores X 
using d -I- Ig Ig n -|- o(l) bits. The estimate's standard deviation is -^ri where c 
fluctuates between about 0.58 and 0.61 asymptotically (see Corollary [7] for a 
precise characterization). Notice that a q-ary counter with r = M has asymp- 
totically the same memory usage, and a standard deviation of about ^^^ri (see 
Eq. (H))). Our algorithm thus has similar memory usage and accuracy as q- 
ary counting. The floating-point counter is more advantageous in two aspects. 
First, the first M updates are deterministic, i.e., small values are exactly repre- 
sented with convenience. Second, the counter can be implemented with a few 
elementary integer and bitwise operations, whereas a g-ary counter works with 
irrational probabilities. The random updates in the floating-point counter occur 
with exact integer powers 2~% and such random values can be generated using 
an average of 2 random bits. Specifically, the FP-lncrement procedure uses an 
expected number of ^2 — 5*^) calls to the random bit generator RandomBit(). 
In contrast, a q-aiy counter needs a uniform random number in the range (0, 1) 
to produce a random event with probability 2^-^/^ . 

The rest of the paper is organized as follows. In order to quantify the per- 
formance of floating-point counters, we found it fruitful to develop a general 
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analysis of probabilistic counting, which is of independent mathematical inter- 
est. Section [2] presents the main results about the accuracy of probabilistic 
counting methods. First, Theorem [T] shows that every probabilistic counting 
method has a unique unbiased estimator / with Kf{X) = n after n probabilis- 
tic updates. Second, Theorem O shows that the accuracy of any such method is 
computable directly from the counter value. Finally, Theorem [3] gives relatively 
simple upper and lower bounds on the asymptotic accuracy of the unbiased 
estimator. The proofs of the theorems are given in Section [31 which can be 
safely skipped on first reading. Section 2] presents floating-point counters in de- 
tail, and mathematically characterizes their utility by relying on the results of 
Section [51 Section [H further illustrates the theoretical analyses with simulation 
experiments comparing g-ary and floating-point counters. 

2 Probabilistic counting 

For a formal discussion of probabilistic counting, consider the Markov chain 
formed by the successive counter values. 

Definition 1. A counting chain is a Markov chain {Xn : n = 0, 1, . . . ) with 



where < (/fc < 1 are the transition probabilities defining the counter. 

It is a classic result associated with probabilities in pure-birth processes [7] 
that the n-step probabilities Pn{k) = F{^n = are computable by a simple 
recurrence (see Equations (|8aH8bp later). In case of probabilistic counting, we 
want to infer n from the value of X„ alone through a computable function /. 
A given probabilistic counting method is defined by the transition probabilities 
and the function /. As we will see later (Theorem[T]) , the transition probabilities 
determine a unique function / that gives an unbiased estimate of the update 
count n. 

Definition 2. A function f : N t-^ N is an unbiased count estimator for a given 
counting chain if and only i/E/(A"„) = n holds for all n = 0,1, ... . 

In the upcoming discussions, we assume that the probabilistic counting 
method uses an unbiased count estimator /. The merit of a given method 
is gauged by its accuracy, as defined below. 

Definition 3. The accuracy of the counter is the coefficient of variation 




0; 



(2a) 

(2b) 



(2c) 



An — 



VVar/(X„) 
IE/(X„) 



3 



The theorems below provide an analytical framework for evaluating prob- 
abilistic counters. Theorem [T] shows that the unbiased estimator is uniquely 
defined by a relatively simple expression involving the transition probabilities. 
Theorem[5]shows that the uncertainty of the estimate can be determined directly 
from the counter value. Theorem [3] gives a practical bound on the asymptotic 
accuracy of the counter. 

Theorem 1. The function 

/(O) = (3a) 

f (k) = — + — + ... + {k>0} (3b) 

qo qi qk-i 

uniquely defines the unbiased count estimator f for any given set of transition 
probabilities {qk- k — 0,1,...). Thus, for any given counting chain, we can 
determine efficiently an unbiased estimator. 

Theorem [T] confirms the intuition that the transition probabilities must be 
exponentially decreasing in order to achieve storage on lglgn-|- 0(1) bits. Oth- 
erwise, with subexponential q'j^^ ~ 2°^*''', one would have f{k) = 2°''^\ leading 
to Ign = o(fc). 

The next definition provides a computable function for quantifying the un- 
certainty of f{X). 

Definition 4. The variance function for a given counting chain is defined by 
5(0) = (4a) 

= —2— + —2— + • • • + —2 {k > 0} (4b) 

9o 9i 9fc_i 

Theorem [5] below shows that the accuracy is computable directly from the 
counter value for any counting chain. The statement has a practical relevance 
(since count estimates can be coupled with the variance function's value), and 
the variance function is used to evaluate the asymptotic accuracy of any counting 
chain (see Theorem [3]). 

Theorem 2. The variance function g of Definition [7] provides an unbiased 
estimate for the variance of f from Theorem [7J Specifically, 

Var/(X„) -Eg(X„) (5) 

holds for all n >0, where the moments refer to the space of n-step probabilities. 

Theorem |3] is the last main result of this section. The statement relates 
the asymptotics of the variance function, the unbiased count estimator, and the 
counting chain's accuracy. 

Theorem 3. Let An be the accuracy of Definitions^ and let 
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Let liminffc^oo ^fe = A*- Suppose that Km supj.^^^ -Bfc = A < 1 (and, thus, 
jj, < 1). Then 

limsupA„ < /'^ (7a) 

n— ►oo V 1 ~ A 

liminf A„ > ^ ^ (7b) 

Example Consider the case of a q-aiy counter, where qi = with some 
g > 1. Theorem [T] automatically gives the unbiased count estimator 

Theorem [2] yields the variance function 



9-1 



_ 1 q*^ — 1 

(72 - 1 " 1 ■ 



5(fc) = Efe"-'^r^) 

In order to use Theorem [Sj observe that 

A hm — — = — — < 1. 

fc^co p{k) q + l 

Therefore, we obtain the known result [3] that lim„_>oo A\ — jz^^ = 

3 Proofs 

In what follows, we use the shorthand notation 

Pn{k)^¥{Xn = k} 

for the n-step probabilities. By (HJ, po(0) = 1, and the recurrences 

p„+i(0) = (l-go)Pn(O) (8a) 
Pn+i{k) = {1 - qk)pn{k) + qk-iPn{k - I) {k>Q} (8b) 

hold for all rt > 0. 

Lemma 4. The unbiased estimator is unique. 

Proof. Since E/(0) = is imposed, and Xo = with certainty, /(O) = 0. For 
aU 71, ¥{Xn > 71} 0, so 



E/(X„) -E?'"(^)/(^) 



k=0 
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Thus, for all 71 > 0, 



fin) 



Pn{n) qoqi---qn-i 



which shows that f{n) is uniquely determined by f{Q),...,f{n~ 1) and the 
n-step probabilities. □ 

Proof of Theorem[l[ Define the durations Lk{n) — '}Z^i=oi'^i ~ ^^"^ 
number of times Xi — k for i < n. Define also — linin— ,00 Lk{n) = 
Y^°^Q{Xi = k}. Clearly, EL^ = l/qk- By the linearity of expectations, 

00 

ELk=ELk{n)+EY^{X, = k} 

i—n 

OQ 

= ELkin) +E^{X, = k} I X„ < fc]p{X„ < k} 

i—n 

= ELkin) +¥{Xn < k}ELk, 

where we used the memoryless property of the geometric distribution in the last 
step. Consequently, 

P{X„ > fc} 

ELk[n) . (9) 

qk 

Now, 

00 00 n fc— 1 -. n 

= J2 > fc}- = E - ^ ^P„(fc)/(fc) = E/(X„). 

A;=0 fe=0 * fc=0 i=0 fc=0 

Since X^fcLo ^kin) — n, we have E/(X„) = n for all n. By Lemma SJ no other 
function / has the same property. □ 

Proof of Theorem\M By (0, for aU 71 > 0, 

n+l 

E/2(X„+i) = ^p„+i(A:)/2(fc) 

71 n+l 

= - Qk)Pnik)f^k) + ^ gfc_ip„(fc ~ l)f{k) 

k^O k^l 

n n+l 

= E/2(X„) - ^ qkPnik)f{k) + J2 Qk-lPnik - l)(/(fc - 1) + q-\] 

k^O k^l 
n n 

= E/2(X„) + 2^p„(fc)/(fc) +J2Pnik)qk' 



k=Q k=0 



Ef\X„) + 2n + Y,Pnik) 



fe=0 
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Since Var/(X„) = E/2(X„) - (e/(X„)) = E/2(X„) - n\ 

n 

Var/(X„+i) = Var/(X„) + 5Zp„(fc)g^' - 1. (10) 

fe=0 

By © and ([8]), 

n+l 

E5(X„+i) = ^p„+i(%(fc) 

1 - gfc-1 
,,2 



*:=0 

n+l 



E5(X„)-^gfeP„(%(fc) + ^ qk-lPn (fc~l)(g(fc-l) + 



Eg(X„) + ^p„(fc)- 



fc=0 /c=l 

l-qk 



= Eg(X„) + ^p„(fc)g-i-l. 

fe=0 

By (Uni), Var/(X„+i) - Var/(X„) = Eg{Xn+i) - Eg{Xn) holds for all n > 0. 
Since Var f{Xo) = Eg(Xo) = 0, Var /(X„) = E.g(X„j holds for all n. □ 

Proof of Theorem Define 

Var/(X„) _ EZoPnjk) ■ gjk) 



Let e > be an arbitrary threshold. By the definition of A, there exists K 
such that 

for all fc > if . Therefore, 

j:toPnik)P{k)+Ek>KPnik)P{k) 
EtoP»(%(fc) + (1 + e)X^ T.k>KPn{k)f\k) 

Ek>KPn{k)fHk) 

= (l + e)A2+ ^k=oPnik)9{k) 



Ek>KPnik)fHky 



Since qt > for all fc, lim„^oo Pn{k) = for all k. Consequently, lim„^oo ^^=oPn{k)g{k) 
0. As liin„_,oo J2k>K Pnik)f^{k) — oo, there exists N such that 

W„ < (1 + 2e)X^ for aU n> N. (11) 
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Since Var/(X„) = E/2(X„) - E2/(X„), 

Var/(X„) 



Var/(a;„) + ' 
By m, vaTTcS+n^ < (1 + 2e)A' for all n>N. So, 



Var/(X„) ^ (l + 2e)A2 



1 - (l + 2e)A2 
1 



I-AH l-(l + 2e)A2 

Since e is arbitrarily small and A^ < 1, 

Var/(X„) A2 
iini sup ^ < 



n2 - 1 - A2 ■ 

The lower bound is proven analogously. Let e > be an arbitrary threshold. 
Let K be such that > (1 - for all k> K. So, 

For n large enough, W„ > (1 — 2e)fi^ holds. Since e is arbitrarily small, and 
< A2 < 1, 

liminf^-^(-^")> 



1 - /i2 

□ 

4 Floating-point counters 

The counting chain for a floating-point counter is defined using a design param- 
eter M = 2"^ with some nonnegative integer d: 

P{x„+i = A; + 1 I X„ = fc} = 2-Lfe/^^J ; (12a) 

P{x„+i = fc I Xn - fc} = 1 - 2-L'^/^-^J . (12b) 

Figure [T] illustrates the states of the floating-point counter. The counter's 
designation becomes apparent from examining the binary representation of the 
counter value k. Write k — Mt + u with 

t = [fc/A/J u^k mod M; 

i.e., u corresponds to the lower d bits of k, and t corresponds to the remaining 
upper bits. Theorem [1] applies with = 2^Lfe/Ji^J^ leading to the following 
Corollary. 
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Deterministic updates 
Oi=0,l,...,A/-l) 



Probabilistic updates 
{h=M,M+1,...) 



0,0 



0,1 



0,Af-l i. 



1/2 

... - 1/4 ^ 2,M-1 





1/8 ^ 3,A/-1 
■ 7/8 • 



Etc. 



Figure 1: States of the counting Markov chain. Each state is labeled with a 
pair (t, u), where {u + M) are the most significant digits and t is the number of 
trailing zeros for the true count. 



Corollary 5. The unbiased estimator for k = Mt + u is 

f{k)^fit,u)^{M + u)2'-M. 



(13) 



In other words, (t, u) is essentially a floating-point representation of the true 
count n, where t is the exponent, and u is a d-bit significand without the hidden 
bit for the leading '1.' 

Theorem [5] yields the following Corollary. 



Corollary 6. The variance function for the floating-point counter is 
g(k) = g(t, u)^(^^+ u^4* - (Af + u)2* + ^M. 
Combining Corollaries and [SI we get the following bounds. 



(14) 



Corollary 7. The accuracy of the floating-point counter is asymptotically bounded 
as 



lim sup An < 

n — *oo 

lim inf An > 



8M-3 



1 



3A/ - 1 



Proof. By Equations (fT^ and (fH)l . we have 

g{t, u) 



f+u 



}^ f2(t^u) (M + m)2 
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d^8 - 
H=7 « 




Figure 2: Error trajectories for floating-point counters (top) and q-aiy counters 
(bottom). Each trajectory follows the the appropriate counting chain in a 
random simulated run. The lines trace the relative error (/(X„) — n)/n for 
floating-point counters with d-bit mantissa, and comparable g-ary counters with 
q — 2^1"^ where r = 2'^ . The shaded areas indicate a relative error of ±0.59-2"''/^. 
The dots at the end of the trajectories denote the final value for n = 100000. 

Considering the extreme values at u = and u — MjZ, respectively: 

= liminf ^ = \m-\ X' = hmsup j§- ^ |m-^ (15) 

fc^oo /^(fc) 3 fe^oo / (fc) 8 

Plugging these limits into Theorem [3] leads to the Corollary. □ 
For large M = 2'^, the bounds of Corollary [7] become 

limsupA„ <2-''/2y378« 0.612 •2-''/2 

n — »-oo 

liminf A„ > 2-''/2yT73w 0.577- 2-^^/2. 

The accuracy is thus comparable to the accuracy of a q-ary counter with q = 
2^ , which is approximately 2~'^/^V0.5 • In 2 « 0.589 • 2"''/^. The memory 
requirements of the two counters are equivalent: in order to count up to n = 
/(fc), Ig /c = d -I- Ig Ig n -I- o(l) bits are necessary. 

Figures [5] and [3] compare the performance of the floating-point counters 
with equivalent base-g counters in simulation experiments. The equivalence is 
manifest on Figure [2] that illustrates the trajectories of the estimates by the 
different counters. Figure [3] plots statistics about the estimates across multiple 
experiments: the estimators are clearly unbiased, and the two counters display 
the same accuracy. 
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Figure 3: Distribution of the estimates for a floating-point counter (top) and 
a comparable q-sxy counter (bottom). Each plot depicts the result of 1000 
experiments, in which a floating-point counter with d = 4-bit mantissa, and 
a q-ary counter with q — 2^/^^ were run until n = 100,000. The dots in the 
middle follow the averages; the black segments depict the standard deviations 
(for each a, they are of length a spaced at a from the average), and grey dots 
show outliers that differ by more than ±2(7 from the average. The shading 
highlights the asymptotic relative accuracy of the g-ary counter (« 0.59 • 2^''/^). 
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